A state-of-the-art methodology for high-throughput in silico vaccine discovery against protozoan parasites and exemplified with discovered candidates for Toxoplasma gondii

Vaccine discovery against eukaryotic parasites is not trivial as highlighted by the limited number of known vaccines compared to the number of protozoal diseases that need one. Only three of 17 priority diseases have commercial vaccines. Live and attenuated vaccines have proved to be more effective than subunit vaccines but adversely pose more unacceptable risks. One promising approach for subunit vaccines is in silico vaccine discovery, which predicts protein vaccine candidates given thousands of target organism protein sequences. This approach, nonetheless, is an overarching concept with no standardised guidebook on implementation. No known subunit vaccines against protozoan parasites exist as a result of this approach, and consequently none to emulate. The study goal was to combine current in silico discovery knowledge specific to protozoan parasites and develop a workflow representing a state-of-the-art approach. This approach reflectively integrates a parasite’s biology, a host's immune system defences, and importantly, bioinformatics programs needed to predict vaccine candidates. To demonstrate the workflow effectiveness, every Toxoplasma gondii protein was ranked in its capacity to provide long-term protective immunity. Although testing in animal models is required to validate these predictions, most of the top ranked candidates are supported by publications reinforcing our confidence in the approach.


Background
First assumption is that once the pathogen 'invades' a nucleated cell and resides in the parasitophorous vacuole, it is beyond the reach of the immune system. That it, the parasite peptides 'will not' be presented on MHC I molecules. Second assumption is that given the opportunity, professional antigen presenting cells (APCs) (e.g., macrophages, dendritic cells) will engulf and digest the pathogen, and present peptides on MHC II molecules for the benefit of helper T cells. That is, the engulfed pathogen proteins are proteolytically fragmented into peptides.
A proposed strategy to improve T-cell epitope predictions is to first predict the fragmented peptides (by predicting the cleavage sites), and then predict the binding affinity to MHC II molecules using existing predictors. This strategy also proposes that every pathogen protein is processed for cleavage sites, irrespective of subcellular location.
One uncertainty in the approach is that it is not clear whether the fragmentation of a protein via a host's (e.g., human) proteasome is the same each time. In other words, do we always end up with the same set of peptide fragments given the same pathogen? Another uncertainty is the use of peptides in a multi-epitope vaccine. As an example, assume peptides are predicted that have a high binding affinity to MHC II molecules, irrespective of the prediction method. In the typical vaccinology workflow, these peptide sequences are linked together along with an adjuvant sequence to form a vaccine construct. Our expectation is that it is the adjuvant providing the pathogen-associated molecular patterns (PAMPs) (and not the peptide candidates) that are recognised by the APCs. In such a case, the vaccine construct once engulfed by the APCs would be fragmented. What is not clear is whether the intended peptide fragments would be retained. It is expected that they will not be because the original proteasomal cleavage sites on the parent proteins no longer exist in the vaccine construct. Our current leaning for obtaining a cell-mediated response is to include entire proteins in a vaccine formulation, and in particular proteins that contain a high density of peptides with high binding affinity. The expectation is that some of the desired proteolytically fragmented peptides will be presented for T-cell helper inspection.

Testing existing programs that predict proteasomal cleavage sites
Only one program could be found that predicts which peptides are naturally processed by the MHC class II antigen presenting pathway (there are more programs available for MHC I e.g., Netchop). The one MHC II program is called NetCleave and as quoted in the publication, "predictions achieve great predictive power towards class I isotypes (AUC ~ 0.92) and modest predictive power towards class II isotypes (AUC ~ 0.66)". Predictions of 66% for MHC II are discouraging because this poor accuracy will obviously have a detrimental knock-on effect on other in silico vaccine discovery steps.
For testing existing tools, known peptides were downloaded from IEDB (see Supplementary  Tables for Data S1, sheet [IEDB_epitopes]). Disappointingly, there are only 23 peptides from 15 Toxoplasma gondii proteins (see below Appendix A: Test_analysis). And even some of these peptides are questionable e.g., the peptides are the outcome from 13 publications and five are more than 20 years old.
The only program that specifically predicted cleavage sites for MHC II ligands, given an antigen sequence, was MHCII-NP (see Appendix A: Test_analysis). This program, however, did not correctly predict any of the cleavage sites of the 23 T. gondii peptides (see Appendix B -MHCII-NP_results). It is unclear if the test peptides are correct or the program is simply inaccurate. As a further test, two human MHCII peptides from IEDB were tried, but the cleavage sites were still incorrectly predicted by MHCII-NP. MHCII-NP uses previously identified motifs from 10 residues prior and beyond the N-and C-terminuses of known peptides to predict the cleavage sites (no machine learning is used). NetCleave (developed in 2021) uses neural networks to predict C-terminal cleavage sites. However, it strangely does not take antigen sequences for the input, which seems to defeat the purpose of the predictor i.e., the input is only seven residues and it makes a prediction on the assumption that four residues occur before the cleavage site and three after. Furthermore, one needs to specify one of four MHC II allele classes (HLA, HLA_DP, HLA_DQ, HLA_DR) as part of the input. There are other recent programs like ITCell (see Appendix A: Test_analysis) that predict cleavage sites as part of their strategy for predicting T-cell epitopes but do not show the cleavage sites in their results.

Developing a program to predict cleavage sites
It appears that current peptide-MHC binding predictors do not take in to account MHC ligands/fragments as a result of natural antigen processing. Moreover, it is not clear whether the lengths of amino acids 'predicted' as peptides exactly correspond to a fragment length generated by proteolytic break down of the protein.

Proposed methodology #1
Known MHC II peptides will be downloaded from IEDB. Ideally, the host organism for the antigen processing should be 'human' and the source of the antigen, an infectious organism.
Given the source antigen sequence of the peptide in the following context: N-terminal sequence + peptide sequence + C-terminal sequence Note: a MHCII peptide sequence is typically 12 to 25 amino acids (AAs) in length, but potentially can be an entire protein. Peptide binding to MHCII is mainly determined by interactions within a limited sized binding groove, which means a consecutive stretch of only 9 AAs of the peptide typically does the binding (i.e., the 'binding core'). Residues protruding from either side of the binding groove are commonly known as peptide flanking regions (PFRs). Note that peptides are possibly trimmed for presentation after protein is fragmented, which may complicate predictions. The source antigen in the context of the above characteristics would be: Variable N-terminal sequence + variable PFR + 9AAs binding core + variable PFR + variable C-terminal sequence A proposed training data sequence for the N-terminal: 10AAs preceding peptide + first 3AAs from PFR A proposed training data sequence for the C-terminal: Last 3AAs from PFR + 10AAs following peptide Note: This means that in the training data only peptides >= 15 are used (e.g. first 3AAs from PFR + variable number of AAs from PFR + 9 AAs binding core + last 3AAs from PFR + variable number of AAs from PFR) Here is an example in machine learning (ML) terms (except letters will be converted to a preset digital value):  ID,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10,PFR1,PFR2,PFR3,Target   C-Terminal ML header: ID, PFR1,PFR2,PFR3,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10,Target For the negative training sequences, all the positive training sequences are used but randomly shuffled. Different ML algorithms and models will be trained and evaluated to see which one performs the best on test data. The expectation is that there will be one ML model for the Nterminal and one for the C-terminal cleavage site.
Here is an example of how the input sequences for predicting the cleavage sites are represented.

Example sequence: MTTSASSHLNKGIKQVYMSLPQGEKVQAMYIWIDGTGEGLRCKTRTLDSEPKCVEE LPEW
The sequence is broken up in accordance to a fixed sliding window of 13 AAs. Hence the sequences for the N-terminal cleavage sites model will be: The sequences for the C-terminal cleavage sites model will be: Following the predictions, there will be a probability score for the classification correctness of the sequences N10, N11, N12 etc. and C3, C4, C5 etc. Each sequence has a number associated with its identifier. This number represents the position in the antigen sequence where the cleavage is expected to occur.
The aim is to take a high-scoring N sequence and then find its nearest high-scoring C sequence. If the length between the N and the C (given their sequence identifiers) is >= 15 and =<25 then the sequence between the two sites will be flagged as a possible ligand.

Proposed methodology #2
The strongest motif signals in the vicinity of cleavage sites are in the PFR regions. An alternative method is to use different ML models for different peptide lengths e.g., one model trained, as above, on only peptides of length 15 and 16 (with 3AAs from PFR in the training sequence), and then another model for peptides of length 17 and 18 (with 4AAs from PFR in the training sequence), and then one for 19 and 20 (with 5AAs from PFR), and so on. This is still based on the assumption that the binding core is 9AAs.
The input sequences would be broken up as before but with a 13 AA fixed sliding window, then a 14AA window, then a 15AA, and so on. The different sets of sequence lengths from the same source antigen would be processed with the corresponding models e.g., 15-16 model, 17-18 model, and 19-20 model. The length thresholds between the highest-scoring N and the C sites would therefore be '>14 and <17', and '> 16 and <19' for the 15-16 model and 17-18 model, respectively. The final outcome would be a list of ligands ranked on their cleavage site scores.

Results
Using the methodology #1 previously described, two training datasets representing over 50,000 MHC II peptides of length 15, one for the N-and the other for the C-terminal were created. The most frequent length for a MHC II peptide is 15 AAs (see Supplementary Tables for Data S1, sheet [Peptide Lengths 8 to 26 AAs]). The peptide lengths were compiled from 470,217 epitopes associated with 18,514 parent proteins downloaded from IEDB (See Appendix C: Predicting cleavage sites for more details) Disappointingly, the ML prediction accuracy averaged only around 57% using 70% / 30% data split for testing. The algorithms used with default parameters were Random Forest, SVM, and neural networks using). One promising aspect is that there are clear signals to exploit. Refer to Supplementary Tables for Data S1, sheet [AA_frequencies]. This sheet shows the frequency of the 20 amino acids (AA) given 17,944 parent proteins of 52,976 peptides of length 15AAs e.g., 9.94% are 'L', 8.21% are 'S'. Sheet [N_L15_Pos_Context10_PFR3] shows in descending order the ratio of observed and expected frequencies at positions P1-P10 (the context) representing 10AA before the N-Terminal of the peptide, and positions PFR1-3 representing the flanking regions of the core binding region (typically 9AAs). For example, a ratio of 1.0 is exactly the frequencies you would expect by chance at a specific position. AAs at positions that are well above expectation are denoted by +++ or below expectation by ---). One can clearly see that the most informative positions are P10 (the expected cleavage site) and the flanking regions. To support these finding [N_L15_Neg_Context10_PFR3] shows the frequencies if the AAs in the 52,976 peptides are randomly shuffled. Similar results for the C-Terminal end were observed (see sheets [C_L15_Pos_PFR3_Context10] and [C_L15_Neg_PFR3_Context10]. To show that there are no informative positions before P1-P10, the context was changed to 20 (see sheet [N_L15_Pos_Context20 PFR9]).
The challenge is that although there appears to be signals at P10 and PFR1-3, there is also a huge amount of noise. Take PFR2 for example in sheet [N_L15_Pos_Context10_PFR3]. For a completely noiseless signal all 52,976 peptides at PFR2 would have the same AA e.g., 'P'. However, all 20 AAs are present at PFR2, albeit at different frequencies. Nonetheless, the ratio (observed/expected) suggests that P is present at this location not by chance. The expected number of Ps at position PFR2 is 3,311 but we observe 5,467. Our challenge is that there is no clear motif. That is, although P is the most frequent AA at PFR2, only 10.3% of the peptides in our training data have P at this location (7.8% for A, 7.6% G etc.) The expected reason for only obtaining 57% accuracy from the ML models is the excessive noise e.g., 93.2% accuracy can be achieved if 'P' is the only AA at PFR2 (100% accuracy was not obtained because some negative peptides also have a P at PFR2). This latter high accuracy supports that the default ML parameters, the number of features, and the methodology used are fit for the purpose Further analysis of the training data is required. For example, it may be informative to determine how many of the 52,976 peptides have different AAs to that expected at P10 and PFR1-3. Then analyse whether there are other informative differences (other than AAs) between those 'meeting' and those 'not meeting' expectations. Maybe above or below expectation values can be used as predictors in the training data. Also, it may be worth investigating whether there is any chemical and/or physical commonality between the most frequent AAs that do not meet expectations e.g., are the AA frequencies that exceed expectation hydrophilic? Then maybe chemical and physical properties could be used for the predictors.
The following ML predictor types for the context and flanking regions of MHCII peptides were tested using 70% / 30% data split of the training data, but the ML prediction accuracy averaged only around 57% • Amino acids e.g., P1 = 15 (for C), P2= 20 (for P) • Ratios (observed/expected) e.g., P1 = 0.974 (for L), P2 = 0.888 (for S) • AA property types (i.e., grouping amino acids on properties such as Charge, Polar, Amphipathic, and Hydrophobic) e.g., P1 = 1 (for Charged R K D E), P2 = 4 (for Hydrophobic A I L F V P G) and also grouping amino acids based on R Groups The source for the training data was reanalysed. In IEDB, there are 627,721 MHC II related ligands (peptides) associated with 687 publications. The publication years range from 1989 to 2022. The question is how reliable are these peptides? How can we determine which of the 627,721 peptides are bona fide peptides? The expectation is that peptides determined in 1989 would not be as accurate as those from 2022.
The above 70%/30% tests were repeated but only with peptides determined in the last 10 years (13,333 peptides). The results improved slightly but still averaged only around 60% accuracy. It is still debatable, however, whether the slight improvement was gained by weeding out possible inaccurate peptides.
Given the the13,333 peptides, many peptides are associated with one or more publications in. For example, one peptide from the same protein is associated with 20 publications (the highest noted association); and 1,910 (14.3%) peptides are associated with only one publication (the lowest noted association). Our assumption is that peptides associated with several publications are possibly more reliable than those with fewer publications. With this in mind, the 13,333 dataset was weighted as follows: each peptide was duplicated according to the number of associated publications e.g., the peptide associated with the 20 publications was duplicated so that a further 19 identical peptides were added to the 13,333; for peptides associated with 19 publications, 18 identical peptides were added to the total and so.
The above 70%/30% tests were repeated and a well over 90% accuracy was obtained. However, our approach of weighting the data may possibly be too subjective and needs to be evaluated by machine learning experts. The results have, however, raised some interesting points. For example, each group of identical peptides represent less than 1% of the total number, but collectively the introduction of these identical peptides clearly boosts the signal for ML. Furthermore, the context and flanking regions of the 13,333 peptides were clustered based on sequence similarity. Based on 40% similarity, 103 clusters (groups) were generated with 497 members in the largest group and only three in the smallest. If the 70%/30% tests are performed using the members from each group to create 103 training models (i.e., the positive dataset only contains the number of members e.g., 497 will be the largest training dataset), the accuracy is over 80% for the larger membership models. This is not proposed as a solution, but it seems to support the idea that there is no universal cleavage signal and that there are possible natural groups of peptides.
All the Literature was note fully investigated on cleavage site prediction from the endocytic proteases' perceptive, it seems to suggest, "the three main classes of intracellular proteases residing in the lysosomal/endosomal compartments and participating in antigen degradation are 1) cysteine (cathepsin B, F, H, L, S, Z, and AEP, for asparaginylendopeptidase); 2) aspartate (cathepsin D, E); and 3) serine (cathepsin A, G) proteases". The latter statement implies that there should be three main groups of peptides distinguished by the amino acids C, D, and S, respectively in the vicinity of the natural cleavage site. Unfortunately, this was not the case when the 627,721 peptides from IEDB were analysed, and IEDB does not make any association between proteases and peptides. Elsewhere in the Literature it also states, "Most endocytic proteases display broad cleavage specificity". Although this seems to be a contradiction to the previously described three classes of intracellular proteases, we believe the way forward is to find the 'appropriate' grouping of the source peptides to create separate training models e.g., searching for amino acid cleavage motifs for each cathepsin class. The aim would be to create an ML model for each class. From our searching so far, there appears to be no record in any peptide database of which cathepsin contributed to the known peptide.

Appendixes Appendix A: Text analysis
Test data Downloaded from IEDB: 34 linear peptides known to be MHC II ligands based on 70 assays associated with 13 publications from 1991 to 2018 (5 of the 13 are more than 20 years old).

MHCII-NP (2018)
Source: http://tools.iedb.org/mhciinp/ Online program to predict which MHC II ligands are generated as a result of natural antigen processing.

Dataset:
14,000 naturally processed ligands identified by mass spectrometry of peptides eluted from MHC class II expressing. Data downloaded from IEDB. It is not clear which organism was the host for the antigen processing. I think that it should be only be one host e.g., human, Methodology: Identify sequence signatures potentially related to the cleavage mechanisms that liberate the presented peptides from their source antigens (no machine learning used) The sequence regions analysed included 10 residues prior and past the N-and C-terminuses of each ligand including residues at both termini.

Methodology:
Uses a neural network trained on 46 different physicochemical descriptors of the cleavage site amino acids Uses a short sequence of seven residues to generate predictions: four residues placed before and three after the cleavage site. Each residue is encoded with 46 amino acid descriptors (describing steric, electrostatic and hydrophobic properties), resulting in a total amount of 336 descriptors for each short sequence i.e., 7aa * 46 = 322?? Extra 14?
ITCell (Integrative T-cell epitope prediction) (2018) Source: http://salilab.org/itcell This online program predicts cleavage sites, peptide-MHC II binding, and TCR epitope recognition. The output does not display the cleavage site location The program requires the user to specify the MHC type or upload their own peptide-MHC structure (or model) in PDB format AND the TCR structure in a PDB format i.e., it predicts T-cell epitopes, given the antigen, MHCII, and TCR sequences.
Methodology for cleavage sites: Cleavage sites are predicted in the antigen sequence based on cathepsins S, B, and H cleavage profiles. These sites are used to filter the set of all possible 12-mer peptides for the next step (peptide-MHCII predictions).
Cathepsin S has a preference for hydrophobic residues in the P2 position and for positively charged residues in the P1 position, consistent with the profiling of the P1-P4 positions using a combinatorial library. The remaining positions have broad specificity, consistent with the major role of this enzyme in antigen processing.
The cleavage site predictor was built for each cathepsin individually. First, the MSP-MS results were summarized in a matrix that contained amino acid residue counts from the cleaved peptides for each of the eight positions (P4 to P4'). This matrix was then used to compute a Hidden Markov Model (HMM) for cleavage prediction. The residue counts were converted to probabilities, normalized by the natural occurrence of amino acid residues; and the score, cleavage_score, is the sum of the log of normalized probabilities. All octapeptides in the antigen sequence are considered for proteolytic cleavage at the P1-P1'position. For a confident cleavage prediction, we require cleavage_score > 3.0, corresponding to at least three times higher likelihood of cleavage than by chance. We first consider cleavage by endopeptidases cathepsins B and S, followed by the aminopeptidase cathepsin H for additional trimming at the N-terminus; we do not use the cathepsin H profile for endoprotease cleavage, because our specificity matrix was constructed using only aminopeptidase cleavage sites.

Questions:
Is the proteolytic break down of a specific protein always the same each time i.e., do we always end up with the same number of peptides with the same sequence? Is the breakdown different for different host organisms e.g., given the same protein, end up with a different set of peptides in a cow as opposed to a human?

Is the breakdown different for different populations of humans or individuals?
Are peptides trimmed for presentation after protein is fragmented? Notes: In the main endocytic pathway of antigen presentation, extracellular antigens are internalized by phagocytosis, macropinocytosis, or receptor-mediated endocytosis, and degraded in acidic and proteolytic compartments such as lysosomes or late endosomes by proteases generally called cathepsins.
Factors beyond MHC class II binding affinity contribute to likelihood of a peptide to be recognized by CD4+ T cells i.e., predicting an MHCII binding peptide does not always equate to a conclusive prediction of a T-cell epitope.
It is widely recognized that natural processing shapes which peptides are available for binding to MHC molecules and subsequently presented to T cells, and that the capacity to bind MHC molecule is a necessary but not sufficient requisite for immunogenicity.